A new article created using the Distill format.
This take-home exercise aims to sharpen the skill of building data visualization programmatically using ggplot2 by exploring the insights of participants dataset.The data visualizations included in this exercise are: - Create a pareto chart showing the distribution of age group. - Create split violin plots showing the distribution of joviality by educationLevel and haveKids viarables.
The packages required are tidyverse (included relevant packages for data analyses such as ggplot2, readr and dplyr), ggrepel,patchwork and knitr.
The code chunk below is used to install and load the required packages onto RStudio.
The code chunk below import Participants.csv from data
folder into R by using read_csv()
of readr
package and save it as an tibble data frame called
participants.
participants <- read_csv("data/Participants.csv")
glimpse(participants)
Rows: 1,011
Columns: 7
$ participantId <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,~
$ householdSize <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
$ haveKids <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU~
$ age <dbl> 36, 25, 35, 21, 43, 32, 26, 27, 20, 35, 48, 2~
$ educationLevel <chr> "HighSchoolOrCollege", "HighSchoolOrCollege",~
$ interestGroup <chr> "H", "B", "A", "I", "H", "D", "I", "A", "G", ~
$ joviality <dbl> 0.001626703, 0.328086500, 0.393469590, 0.1380~
The table above shows that there are 7 columns 1011 rows in the dataset.
Before dig deep into the data, we use ggplot to visualize the distribution of all the columns.
p_householdsize <- ggplot(data = participants,
aes(x = householdSize))+
geom_bar(color="grey25",
fill="grey90") +
ggtitle("Distribution of Household Size")
p_havekids <- ggplot(data=participants,
aes(x = haveKids)) +
geom_bar(color="grey25",
fill="grey90") +
ggtitle("Distribution of Have Kids")
p_age <- ggplot(data=participants,
aes(x = age)) +
geom_histogram(boundary = 100,
color="grey25",
fill="grey90") +
coord_cartesian(xlim=c(16, 61)) +
ggtitle("Distribution of Age")
p_edu <- ggplot(data=participants,
aes(x = educationLevel)) +
geom_bar(boundary = 100,
color="grey25",
fill="grey90") +
ggtitle("Distribution of Education Level")
p_interest <- ggplot(data=participants,
aes(x = interestGroup)) +
geom_bar(boundary = 100,
color="grey25",
fill="grey90") +
ggtitle("Distribution of Interest Group")
p_joviality<- ggplot(data=participants,
aes(x = joviality)) +
geom_density() +
ggtitle("Distribution of Joviality")
(p_householdsize/p_havekids/p_interest)|(p_age/p_edu/p_joviality)
Figures above show that the data of householdSize, interestGroup and age are equally distributed. For other viarables, we need to explore it further.
Let’s have a look at the relationship between variables.
ggplot(data = participants,
aes(x = educationLevel, fill = haveKids))+
geom_bar()
Figure above shows that higher the education leval, lower possibility to have kids.
ggplot(data = participants,
aes(x = age, fill = haveKids))+
geom_bar()
We wanted to further explore the distribution of age, education, and Joviality, so we decided to use two more complex charts, that’s pareto chart of age group and violin plots of joviality by educationLeval and haveKids.
Pareto charts show the ordered frequency counts of data. They show the ordered frequency counts of values for the different levels of a categorical or nominal variable. These charts are often used to identify areas to focus on first in process improvement, as supported by the Pareto Principle (80/20 Rule).
The following code shows how to perform data binning on the age variable using the cut() and mutate() function with specific break marks:
# A tibble: 1,011 x 8
participantId householdSize haveKids age educationLevel
<dbl> <dbl> <lgl> <dbl> <chr>
1 0 3 TRUE 36 HighSchoolOrCollege
2 1 3 TRUE 25 HighSchoolOrCollege
3 2 3 TRUE 35 HighSchoolOrCollege
4 3 3 TRUE 21 HighSchoolOrCollege
5 4 3 TRUE 43 Bachelors
6 5 3 TRUE 32 HighSchoolOrCollege
7 6 3 TRUE 26 HighSchoolOrCollege
8 7 3 TRUE 27 Bachelors
9 8 3 TRUE 20 Bachelors
10 9 3 TRUE 35 Bachelors
# ... with 1,001 more rows, and 3 more variables:
# interestGroup <chr>, joviality <dbl>, ageGroup <fct>
To achieve the frequency count of age, group_by() of dplyr package is used to group data by age. Then, summarise() of dplyr is used to count (i.e. n()) the number of each age group.
dit_age<- part_age %>%
group_by(`ageGroup`) %>%
summarise('Counts'=n()) %>%
ungroup()
dit_age
# A tibble: 5 x 2
ageGroup Counts
<fct> <int>
1 (17,25] 184
2 (25,35] 243
3 (35,45] 226
4 (45,55] 244
5 (55,60] 114
By default, the values of the tibble data frame is sorted according to the values of the first column.
We will need to sort the age group by descending order of counts in the age group. To accomplish this task, the arrange() of dplyr package is used as shown in the code chunk below.
freq_sorted <- dit_age %>%
arrange(desc(Counts))
head(freq_sorted)
# A tibble: 5 x 2
ageGroup Counts
<fct> <int>
1 (45,55] 244
2 (25,35] 243
3 (35,45] 226
4 (17,25] 184
5 (55,60] 114
Lastly, we can compute the cumulative frequency of age group. This task will be performed by using mutate() of dplyr package and cumsum() of Base R.
The newly computed cumulative frequency values will be stored in a new field called cumfreq.
freq_cum <- freq_sorted %>%
mutate(cumfreq = cumsum(Counts)) %>%
mutate(cum = cumsum(Counts)/sum(Counts))
head(freq_cum)
# A tibble: 5 x 4
ageGroup Counts cumfreq cum
<fct> <int> <int> <dbl>
1 (45,55] 244 244 0.241
2 (25,35] 243 487 0.482
3 (35,45] 226 713 0.705
4 (17,25] 184 897 0.887
5 (55,60] 114 1011 1
A pareto chart was plotted using ggplot2 as follows:
geom_col() instead of geom_bar() was used to create the bar chart as we do not need to modify the data, and want the height of the bar to represent the actual counts of people in each age group.
geom_line() and geom_point() was used for the line and scatter plot for the cumulative frequency of return counts. The scatter plot helps reader to identify the corresponding cumulative frequency for people in each age group in a static chart.
scale_y_continuous() was used to adjust the interval between the grid lines and add a secondary y axes for the cumulative percentage of people counts for each age group. The secondary y axes is just based on a one-to-one transformation of the primary axes, hence both bar and line charts are still plotted with reference to the primary axes.
theme() was lastly used to adjust the background color and grid lines of the plot to improve visibility to readers.
pa <- ggplot(data = freq_cum,
aes(x = reorder(`ageGroup`,-`Counts`))) +
geom_col(aes(y=`Counts`), fill = 'lightblue', width= 0.8) +
geom_point(aes(y=`cumfreq`), color = 'grey20', size = 0.8) +
geom_line(aes(y =`cumfreq`, group = 1), colour = 'grey20', size = 0.4) +
labs(x = "ageGroup", title = "Pareto Chart of Age Group") +
scale_y_continuous(
name = 'Absolute Frequency', breaks = seq(0, 3500, 300), minor_breaks = seq(0, 3500, 100),
sec.axis = sec_axis(~.*1/sum(freq_cum$Counts), name = 'Cumulative Frequency', breaks = seq(0, 1, by = 0.1), labels = scales::percent)
)+
geom_hline(yintercept=0.8*sum(freq_cum$Counts), linetype="dashed", color = "grey50") +
geom_text(aes(3, 0.8*sum(freq_cum$Counts), label = "80.0%", vjust = -0.5), size = 2) +
theme(panel.background = element_rect(fill = 'white', colour = 'grey60', size = 0.5, linetype = 'solid'),
panel.grid.major = element_line(size = 0.3, linetype = 'solid', colour = 'grey85'),
panel.grid.minor = element_line(size = 0.2, linetype = 'solid', colour = 'grey90'),
text = element_text(size = 5.5),
axis.ticks.x = element_blank())
pa
4.2.2 Interactive Pareto Chart
From the previous chart plotted using ggplot2, the secondary y axes was based on a one-to-one transformation of the primary y axes. The bar and line charts were plotted with reference to the primary y axes, hence the secondary y axes will not be transferred to the interactive chart using ggplotly() of plotly library.
Therefore, an interactive pareto chart was plotted using plot_ly() from plotly. A scatter plot was not included in this case as one can simply hover across the line to view the corresponding cumulative percentage for each age group. Similarly, hovering on the bars will reveal the absolute frequency of people counts for each age group.
Call the library first:
plot_ly(freq_cum, x = ~reorder(`ageGroup`,-`Counts`), y = ~`Counts`, type = "bar", name = "Counts") %>%
add_trace(x = ~reorder(`ageGroup`,-`Counts`), y = ~`cum`*100,type = "scatter", mode = "lines", yaxis = "y2", name = "Cum. %") %>%
plotly::layout(title = "Pareto chart of ageGroup",
xaxis = list(title = "ageGroup"),
yaxis = list(title = "Counts (Absolute Frequency)", showgrid = F),
yaxis2 = list(overlaying = "y", side = "right", title = "Cumulative Percentage (%)", range = list(0, 100)),
legend = list(orientation="h", yanchor="bottom",y=0.9,xanchor="top",x=0.2))
From the pareto chart, we can tell that in the dataset, there are less people for people who are older than 55 years old and younger than 25 years old.The others take up g=for about 80%, which is
A new package [introdataviz] need to be installed to visualize the split violin plots.
devtools::install_github("psyteachr/introdataviz")
Code chunk below is used to create the split violin plots.
ggplot(part_age, aes(x = educationLevel, y = joviality, fill = haveKids)) +
introdataviz::geom_split_violin()
In order to more clearly show the distribution difference of Joviality among people with different education levels and people with or without children, box plot needs to be superimposed on the violin chart.
ggplot(part_age, aes(x = educationLevel, y = joviality, fill = haveKids)) +
introdataviz::geom_split_violin() +
geom_boxplot()
ggplot(part_age, aes(x = educationLevel, y = joviality, fill = haveKids)) +
introdataviz::geom_split_violin(alpha = .4, trim = FALSE) +
geom_boxplot(width = .2, alpha = .6, fatten = NULL, show.legend = FALSE) +
stat_summary(fun.data = "mean_se", geom = "pointrange", show.legend = F,
position = position_dodge(.175)) +
scale_y_continuous(breaks = seq(0, 1, 0.2),
limits = c(0, 1)) +
scale_fill_brewer(palette = "Dark2", name = "Have Kids")
From the plot, we can tell that:
For those who does not have kids, higher the education level, higher the joviality level.But for those who has kids, it’s not the case-group from the low education level have a higher joviality level.
The joviality differs the most for people from the lowest education group when comes to whether have kids affect the level of happiness.The median of those who comes from low education group and does not have kids is the lowest, the median of those who comes from low education group but have kids is the hightes.
The medians of groups except for low education are in a similar level no matter whether they have kids or not.